Discussion 01: FIFA Analytics¶
For more info about the data items, you may take a look at: https://sofifa.com/
From marketing, business, and sports management perspectives, this dataset can be highly valuable. Numerous research questions can be generated. For example:
- Sports tech service firms may be interested in analyzing the players' traits and then customize their services.
- Clubs may want to know if their players' characteristics are significantly different than some other major "competitors".
- Sports analysts may want to investigate if certain countries have significantly unbalanced player types.
- Owners of the clubs want to know what type of players may be more associated with performance and club rating and therefore the clubs may adjust their plan to recruit future players.
- International sports product firms may analyze players' patterns and traits across countries and therefore adjust their international marketing focuses and strategies.
- Clubs may want to probe if there are excessive gaps between players' wage and value and therefore the clubs may adjust future assets allocation.
- Firms such as NIKE may be interested in mapping out the players' skills and design its different product lines.
Coaches may want to know different skill sets across players from different countries, different age groups, and other categories.
- Many more.....
We use this dataset to refresh your knowledge about R (or any analysis tool you prefer) and basic data analysis.
Please try to complete following tasks and generate insights from your analysis.
You will:
- Observe and examine the data structure.
- Visually display the descriptive statistics of the variables you are interested in.
- Choose appropriate data analysis methods and examine the relationships between or among the variables that interest you. (I encourage you to use multiple methods or approaches to explore the data.)
Based on your analysis, please generate some insights.
I. Load the Environment¶
# Import standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import warnings
warnings.filterwarnings('ignore')
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from scipy.spatial.distance import squareform
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Load / inspect the data
fifa = pd.read_csv("data/Fifa_Dataset.csv", parse_dates=["birth_date", "club_join_date"])
fifa.head(10)
| id | name | full_name | birth_date | age | height_cm | weight_kgs | positions | nationality | overall_rating | ... | LWB | LDM | CDM | RDM | RWB | LB | LCB | CB | RCB | RB | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 158023 | L. Messi | Lionel Andrés Messi Cuccittini | 1987-06-24 | 31 | 170.18 | 72.1 | CF,RW,ST | Argentina | 94 | ... | 64+2 | 61+2 | 61+2 | 61+2 | 64+2 | 59+2 | 48+2 | 48+2 | 48+2 | 59+2 |
| 1 | 190460 | C. Eriksen | Christian Dannemann Eriksen | 1992-02-14 | 27 | 154.94 | 76.2 | CAM,RM,CM | Denmark | 88 | ... | 71+3 | 71+3 | 71+3 | 71+3 | 71+3 | 66+3 | 57+3 | 57+3 | 57+3 | 66+3 |
| 2 | 195864 | P. Pogba | Paul Pogba | 1993-03-15 | 25 | 190.50 | 83.9 | CM,CAM | France | 88 | ... | 76+3 | 77+3 | 77+3 | 77+3 | 76+3 | 74+3 | 72+3 | 72+3 | 72+3 | 74+3 |
| 3 | 198219 | L. Insigne | Lorenzo Insigne | 1991-06-04 | 27 | 162.56 | 59.0 | LW,ST | Italy | 88 | ... | 63+3 | 58+3 | 58+3 | 58+3 | 63+3 | 58+3 | 44+3 | 44+3 | 44+3 | 58+3 |
| 4 | 201024 | K. Koulibaly | Kalidou Koulibaly | 1991-06-20 | 27 | 187.96 | 88.9 | CB | Senegal | 88 | ... | 73+3 | 77+3 | 77+3 | 77+3 | 73+3 | 76+3 | 85+3 | 85+3 | 85+3 | 76+3 |
| 5 | 203376 | V. van Dijk | Virgil van Dijk | 1991-07-08 | 27 | 193.04 | 92.1 | CB | Netherlands | 88 | ... | 78+3 | 82+3 | 82+3 | 82+3 | 78+3 | 80+3 | 86+3 | 86+3 | 86+3 | 80+3 |
| 6 | 231747 | K. Mbappé | Kylian Mbappé | 1998-12-20 | 20 | 152.40 | 73.0 | RW,ST,RM | France | 88 | ... | 66+3 | 62+3 | 62+3 | 62+3 | 66+3 | 62+3 | 54+3 | 54+3 | 54+3 | 62+3 |
| 7 | 153079 | S. Agüero | Sergio Leonel Agüero del Castillo | 1988-06-02 | 30 | 172.72 | 69.9 | ST | Argentina | 89 | ... | 58+3 | 56+3 | 56+3 | 56+3 | 58+3 | 53+3 | 47+3 | 47+3 | 47+3 | 53+3 |
| 8 | 167495 | M. Neuer | Manuel Neuer | 1986-03-27 | 32 | 193.04 | 92.1 | GK | Germany | 89 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 9 | 179813 | E. Cavani | Edinson Roberto Cavani Gómez | 1987-02-14 | 32 | 185.42 | 77.1 | ST | Uruguay | 89 | ... | 67+3 | 65+3 | 65+3 | 65+3 | 67+3 | 65+3 | 63+3 | 63+3 | 63+3 | 65+3 |
10 rows × 92 columns
# Check data structure
print(fifa.shape, "\n")
fifa.info()
(17954, 92) <class 'pandas.core.frame.DataFrame'> RangeIndex: 17954 entries, 0 to 17953 Data columns (total 92 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 17954 non-null int64 1 name 17954 non-null object 2 full_name 17954 non-null object 3 birth_date 17954 non-null datetime64[ns] 4 age 17954 non-null int64 5 height_cm 17954 non-null float64 6 weight_kgs 17954 non-null float64 7 positions 17954 non-null object 8 nationality 17954 non-null object 9 overall_rating 17954 non-null int64 10 potential 17954 non-null int64 11 value_euro 17699 non-null float64 12 wage_euro 17708 non-null float64 13 preferred_foot 17954 non-null object 14 international_reputation(1-5) 17954 non-null int64 15 weak_foot(1-5) 17954 non-null int64 16 skill_moves(1-5) 17954 non-null int64 17 work_rate 17954 non-null object 18 body_type 17954 non-null object 19 release_clause_euro 16117 non-null float64 20 club_team 17940 non-null object 21 club_rating 17940 non-null float64 22 club_position 17940 non-null object 23 club_jersey_number 17940 non-null float64 24 club_join_date 16018 non-null datetime64[ns] 25 contract_end_year 17593 non-null object 26 national_team 857 non-null object 27 national_rating 857 non-null float64 28 national_team_position 857 non-null object 29 national_jersey_number 857 non-null float64 30 crossing 17954 non-null int64 31 finishing 17954 non-null int64 32 heading_accuracy 17954 non-null int64 33 short_passing 17954 non-null int64 34 volleys 17954 non-null int64 35 dribbling 17954 non-null int64 36 curve 17954 non-null int64 37 freekick_accuracy 17954 non-null int64 38 long_passing 17954 non-null int64 39 ball_control 17954 non-null int64 40 acceleration 17954 non-null int64 41 sprint_speed 17954 non-null int64 42 agility 17954 non-null int64 43 reactions 17954 non-null int64 44 balance 17954 non-null int64 45 shot_power 17954 non-null int64 46 jumping 17954 non-null int64 47 stamina 17954 non-null int64 48 strength 17954 non-null int64 49 long_shots 17954 non-null int64 50 aggression 17954 non-null int64 51 interceptions 17954 non-null int64 52 positioning 17954 non-null int64 53 vision 17954 non-null int64 54 penalties 17954 non-null int64 55 composure 17954 non-null int64 56 marking 17954 non-null int64 57 standing_tackle 17954 non-null int64 58 sliding_tackle 17954 non-null int64 59 GK_diving 17954 non-null int64 60 GK_handling 17954 non-null int64 61 GK_kicking 17954 non-null int64 62 GK_positioning 17954 non-null int64 63 GK_reflexes 17954 non-null int64 64 tags 1417 non-null object 65 traits 8137 non-null object 66 LS 15889 non-null object 67 ST 15889 non-null object 68 RS 15889 non-null object 69 LW 15889 non-null object 70 LF 15889 non-null object 71 CF 15889 non-null object 72 RF 15889 non-null object 73 RW 15889 non-null object 74 LAM 15889 non-null object 75 CAM 15889 non-null object 76 RAM 15889 non-null object 77 LM 15889 non-null object 78 LCM 15889 non-null object 79 CM 15889 non-null object 80 RCM 15889 non-null object 81 RM 15889 non-null object 82 LWB 15889 non-null object 83 LDM 15889 non-null object 84 CDM 15889 non-null object 85 RDM 15889 non-null object 86 RWB 15889 non-null object 87 LB 15889 non-null object 88 LCB 15889 non-null object 89 CB 15889 non-null object 90 RCB 15889 non-null object 91 RB 15889 non-null object dtypes: datetime64[ns](2), float64(9), int64(41), object(40) memory usage: 12.6+ MB
II. EDA¶
# Count missing values by column, excluding full features, and sort by the number of missing entries
missing_all = (
fifa.isna().sum()
.reset_index()
.rename(columns={0: "cnt_na_vals"})
.query("cnt_na_vals != 0")
.sort_values(by="cnt_na_vals", ascending=False)
.reset_index(drop=True)
)
# Visualize the 20 least-populated features
twenty_sparsest = missing_all.head(20)
fig, ax = plt.subplots(figsize=(12, 6))
sns.barplot(data=twenty_sparsest, x="index", y="cnt_na_vals", ax=ax)
plt.xticks(rotation=90)
plt.xlabel("Features")
plt.ylim(0, 18000)
plt.ylabel("Missing Values")
plt.title("Figure 01: Top 20 Sparsest Columns")
plt.tight_layout()
plt.show()
So, from the above, I can see that national_team, national_team_position, national_jersey_number, and national_rating are all 90% sparse. This makes sense, since only a handful of elites would be chosen to play for the national team. However, it means that in the case of any Machine Learning, these features would have to be dropped from consideration. Lets subset those populated national entries for later analysis, then move on.
tags and traits would also likely require consideration, given their sparsity.
# Subset players with National Records.
national = fifa.loc[:, ["id", "national_team", "national_team_position", "national_jersey_number", "national_rating"]].dropna()
national_subset = fifa.loc[fifa["id"].isin(national["id"])]
# Visualize age distribution
sns.histplot(data=fifa, x="age",
binwidth=5,
bins=range(17, 51, 5),
color="orange")
plt.xticks(range(17, 52, 5))
plt.xlabel("Age")
plt.ylabel("Player Count")
plt.title("Figure 02: FIFA Players Age Distribution")
plt.tight_layout()
plt.show()
# Examine physical features: height, weight, and strength
sns.jointplot(data=fifa,
x="weight_kgs",
y="height_cm",
hue="strength",
palette=sns.cubehelix_palette(as_cmap=True),
height=8)
plt.xlim(45, 115)
plt.ylim(150, 210)
plt.xlabel("Weight (kg)")
plt.ylabel("Height (cm)")
plt.title("Figure 03: FIFA Players: Size vs. Strength")
plt.tight_layout()
plt.show()
# Check weight vs. strength
sns.regplot(data=fifa,
x="weight_kgs",
y="strength",
marker="o",
line_kws=dict(color="red"))
plt.xlabel("Weight (kg)")
plt.ylabel("Strength")
plt.title("Figure 04: FIFA Players: Weight vs. Strength")
plt.tight_layout()
plt.show()
# Subset Correlation
physical_corr = fifa.loc[:, ["height_cm", "weight_kgs", "strength"]].corr()
mask = np.triu(np.ones_like(physical_corr, dtype=bool))
sns.heatmap(data=physical_corr, vmin=-1, vmax=1, cmap="coolwarm", annot=True, fmt=".2f", mask=mask)
plt.title("Figure 05: Correlations among Height, Weight, and Strength")
plt.tight_layout()
plt.show()
While height_cm and weight_kgs have a mild positive correlation with one another, strength is almost twice as correlated to weight_kgs ($r \approx 0.61$) as it is to height_cm.
III. Data Engineering¶
# Initialize df, our "clean" df, using non-sparse columns from FIFA
full_cols = sorted(list(set(fifa.columns.tolist()) - set(missing_all["index"].tolist())))
df = fifa.loc[:, full_cols]
# Reduce integers to a 32-bit format
for col in df.select_dtypes(include="int64"):
df[col] = df[col].astype("int32")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 17954 entries, 0 to 17953 Data columns (total 51 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 GK_diving 17954 non-null int32 1 GK_handling 17954 non-null int32 2 GK_kicking 17954 non-null int32 3 GK_positioning 17954 non-null int32 4 GK_reflexes 17954 non-null int32 5 acceleration 17954 non-null int32 6 age 17954 non-null int32 7 aggression 17954 non-null int32 8 agility 17954 non-null int32 9 balance 17954 non-null int32 10 ball_control 17954 non-null int32 11 birth_date 17954 non-null datetime64[ns] 12 body_type 17954 non-null object 13 composure 17954 non-null int32 14 crossing 17954 non-null int32 15 curve 17954 non-null int32 16 dribbling 17954 non-null int32 17 finishing 17954 non-null int32 18 freekick_accuracy 17954 non-null int32 19 full_name 17954 non-null object 20 heading_accuracy 17954 non-null int32 21 height_cm 17954 non-null float64 22 id 17954 non-null int32 23 interceptions 17954 non-null int32 24 international_reputation(1-5) 17954 non-null int32 25 jumping 17954 non-null int32 26 long_passing 17954 non-null int32 27 long_shots 17954 non-null int32 28 marking 17954 non-null int32 29 name 17954 non-null object 30 nationality 17954 non-null object 31 overall_rating 17954 non-null int32 32 penalties 17954 non-null int32 33 positioning 17954 non-null int32 34 positions 17954 non-null object 35 potential 17954 non-null int32 36 preferred_foot 17954 non-null object 37 reactions 17954 non-null int32 38 short_passing 17954 non-null int32 39 shot_power 17954 non-null int32 40 skill_moves(1-5) 17954 non-null int32 41 sliding_tackle 17954 non-null int32 42 sprint_speed 17954 non-null int32 43 stamina 17954 non-null int32 44 standing_tackle 17954 non-null int32 45 strength 17954 non-null int32 46 vision 17954 non-null int32 47 volleys 17954 non-null int32 48 weak_foot(1-5) 17954 non-null int32 49 weight_kgs 17954 non-null float64 50 work_rate 17954 non-null object dtypes: datetime64[ns](1), float64(2), int32(41), object(7) memory usage: 4.2+ MB
# Sus out columns that are missing less than or equal to 15% of their entries.
temp = missing_all["cnt_na_vals"] <= 0.15 * fifa.shape[0]
cleaning_window = fifa.loc[:,
missing_all.loc[temp].reset_index(drop=True)[
"index"].values.tolist()]
cleaning_window.head(10)
| LM | LCM | CM | RCM | RM | LWB | LDM | CF | CAM | CDM | ... | RF | club_join_date | release_clause_euro | contract_end_year | value_euro | wage_euro | club_jersey_number | club_position | club_rating | club_team | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 91+2 | 85+2 | 85+2 | 85+2 | 91+2 | 64+2 | 61+2 | 93+2 | 93+2 | 61+2 | ... | 93+2 | 2004-07-01 | 226500000.0 | 2021 | 110500000.0 | 565000.0 | 10.0 | RW | 86.0 | FC Barcelona |
| 1 | 86+3 | 85+3 | 85+3 | 85+3 | 86+3 | 71+3 | 71+3 | 84+3 | 86+3 | 71+3 | ... | 84+3 | 2013-08-30 | 133800000.0 | 2020 | 69500000.0 | 205000.0 | 23.0 | LCM | 83.0 | Tottenham Hotspur |
| 2 | 83+3 | 84+3 | 84+3 | 84+3 | 83+3 | 76+3 | 77+3 | 83+3 | 84+3 | 77+3 | ... | 83+3 | 2016-08-09 | 144200000.0 | 2021 | 73000000.0 | 255000.0 | 6.0 | LCM | 82.0 | Manchester United |
| 3 | 86+3 | 78+3 | 78+3 | 78+3 | 86+3 | 63+3 | 58+3 | 85+3 | 86+3 | 58+3 | ... | 85+3 | 2010-07-01 | 105400000.0 | 2022 | 62000000.0 | 165000.0 | 24.0 | LS | 82.0 | Napoli |
| 4 | 57+3 | 61+3 | 61+3 | 61+3 | 57+3 | 73+3 | 77+3 | 54+3 | 55+3 | 77+3 | ... | 54+3 | 2014-07-01 | 106500000.0 | 2021 | 60000000.0 | 135000.0 | 26.0 | LCB | 82.0 | Napoli |
| 5 | 68+3 | 73+3 | 73+3 | 73+3 | 68+3 | 78+3 | 82+3 | 67+3 | 68+3 | 82+3 | ... | 67+3 | 2018-01-01 | 114500000.0 | 2023 | 59500000.0 | 215000.0 | 4.0 | LCB | 83.0 | Liverpool |
| 6 | 86+3 | 78+3 | 78+3 | 78+3 | 86+3 | 66+3 | 62+3 | 87+3 | 86+3 | 62+3 | ... | 87+3 | 2018-07-01 | 166100000.0 | 2022 | 81000000.0 | 100000.0 | 7.0 | LS | 84.0 | Paris Saint-Germain |
| 7 | 83+3 | 76+3 | 76+3 | 76+3 | 83+3 | 58+3 | 56+3 | 87+3 | 85+3 | 56+3 | ... | 87+3 | 2011-07-28 | 119300000.0 | 2021 | 64500000.0 | 300000.0 | 10.0 | ST | 85.0 | Manchester City |
| 8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | 2011-07-01 | 62700000.0 | 2021 | 38000000.0 | 130000.0 | 1.0 | GK | 85.0 | FC Bayern München |
| 9 | 79+3 | 75+3 | 75+3 | 75+3 | 79+3 | 67+3 | 65+3 | 83+3 | 80+3 | 65+3 | ... | 83+3 | 2013-07-16 | 111000000.0 | 2020 | 60000000.0 | 200000.0 | 9.0 | RS | 84.0 | Paris Saint-Germain |
10 rows × 35 columns
cleaning_window.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 17954 entries, 0 to 17953 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LM 15889 non-null object 1 LCM 15889 non-null object 2 CM 15889 non-null object 3 RCM 15889 non-null object 4 RM 15889 non-null object 5 LWB 15889 non-null object 6 LDM 15889 non-null object 7 CF 15889 non-null object 8 CAM 15889 non-null object 9 CDM 15889 non-null object 10 RDM 15889 non-null object 11 RWB 15889 non-null object 12 LB 15889 non-null object 13 LCB 15889 non-null object 14 CB 15889 non-null object 15 RCB 15889 non-null object 16 RAM 15889 non-null object 17 RB 15889 non-null object 18 LAM 15889 non-null object 19 LW 15889 non-null object 20 RW 15889 non-null object 21 ST 15889 non-null object 22 RS 15889 non-null object 23 LS 15889 non-null object 24 LF 15889 non-null object 25 RF 15889 non-null object 26 club_join_date 16018 non-null datetime64[ns] 27 release_clause_euro 16117 non-null float64 28 contract_end_year 17593 non-null object 29 value_euro 17699 non-null float64 30 wage_euro 17708 non-null float64 31 club_jersey_number 17940 non-null float64 32 club_position 17940 non-null object 33 club_rating 17940 non-null float64 34 club_team 17940 non-null object dtypes: datetime64[ns](1), float64(5), object(29) memory usage: 4.8+ MB
cleaning_window["LM"].astype("str")
0 91+2
1 86+3
2 83+3
3 86+3
4 57+3
...
17949 66+2
17950 50+2
17951 nan
17952 56+2
17953 52+2
Name: LM, Length: 17954, dtype: object
def calc_from_str(inString):
"""Calculate a sum from a string"""
val = 0
if inString.strip().lower() == "nan":
val = np.nan
elif "+" in inString:
a, b = inString.strip().split("+")
val = int(a) + int(b)
else:
val = int(inString)
return val
# Loop through selected variables, applying the transformation to numeric
temp_df = cleaning_window.loc[:, "LM":"RF"]
for col in temp_df.columns.tolist():
# Perform the calculation
temp_df[col] = temp_df[col].astype("str").apply(calc_from_str)
# Use simple mean imputation
means_dict = {col:temp_df[col].mean().round(0) for col in temp_df.columns.tolist()}
temp_df = temp_df.fillna(value=means_dict).astype("int32")
# Check work
temp_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 17954 entries, 0 to 17953 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LM 17954 non-null int32 1 LCM 17954 non-null int32 2 CM 17954 non-null int32 3 RCM 17954 non-null int32 4 RM 17954 non-null int32 5 LWB 17954 non-null int32 6 LDM 17954 non-null int32 7 CF 17954 non-null int32 8 CAM 17954 non-null int32 9 CDM 17954 non-null int32 10 RDM 17954 non-null int32 11 RWB 17954 non-null int32 12 LB 17954 non-null int32 13 LCB 17954 non-null int32 14 CB 17954 non-null int32 15 RCB 17954 non-null int32 16 RAM 17954 non-null int32 17 RB 17954 non-null int32 18 LAM 17954 non-null int32 19 LW 17954 non-null int32 20 RW 17954 non-null int32 21 ST 17954 non-null int32 22 RS 17954 non-null int32 23 LS 17954 non-null int32 24 LF 17954 non-null int32 25 RF 17954 non-null int32 dtypes: int32(26) memory usage: 1.8 MB
# Push engineered columns to df
df = pd.concat(objs=[df, temp_df], axis=1)
# Remove engineered columns from cleaning_window
cleaning_window = cleaning_window.drop(labels=temp_df.columns.tolist(), axis=1)
# Check what's left
cleaning_window.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 17954 entries, 0 to 17953 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 club_join_date 16018 non-null datetime64[ns] 1 release_clause_euro 16117 non-null float64 2 contract_end_year 17593 non-null object 3 value_euro 17699 non-null float64 4 wage_euro 17708 non-null float64 5 club_jersey_number 17940 non-null float64 6 club_position 17940 non-null object 7 club_rating 17940 non-null float64 8 club_team 17940 non-null object dtypes: datetime64[ns](1), float64(5), object(3) memory usage: 1.2+ MB
cleaning_window.head(10)
| club_join_date | release_clause_euro | contract_end_year | value_euro | wage_euro | club_jersey_number | club_position | club_rating | club_team | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2004-07-01 | 226500000.0 | 2021 | 110500000.0 | 565000.0 | 10.0 | RW | 86.0 | FC Barcelona |
| 1 | 2013-08-30 | 133800000.0 | 2020 | 69500000.0 | 205000.0 | 23.0 | LCM | 83.0 | Tottenham Hotspur |
| 2 | 2016-08-09 | 144200000.0 | 2021 | 73000000.0 | 255000.0 | 6.0 | LCM | 82.0 | Manchester United |
| 3 | 2010-07-01 | 105400000.0 | 2022 | 62000000.0 | 165000.0 | 24.0 | LS | 82.0 | Napoli |
| 4 | 2014-07-01 | 106500000.0 | 2021 | 60000000.0 | 135000.0 | 26.0 | LCB | 82.0 | Napoli |
| 5 | 2018-01-01 | 114500000.0 | 2023 | 59500000.0 | 215000.0 | 4.0 | LCB | 83.0 | Liverpool |
| 6 | 2018-07-01 | 166100000.0 | 2022 | 81000000.0 | 100000.0 | 7.0 | LS | 84.0 | Paris Saint-Germain |
| 7 | 2011-07-28 | 119300000.0 | 2021 | 64500000.0 | 300000.0 | 10.0 | ST | 85.0 | Manchester City |
| 8 | 2011-07-01 | 62700000.0 | 2021 | 38000000.0 | 130000.0 | 1.0 | GK | 85.0 | FC Bayern München |
| 9 | 2013-07-16 | 111000000.0 | 2020 | 60000000.0 | 200000.0 | 9.0 | RS | 84.0 | Paris Saint-Germain |
# Use context to determine fill value policies, as appropriate
cleaning_window["club_jersey_number"] = cleaning_window["club_jersey_number"].fillna(value=0).astype("int32")
cleaning_window["club_position"] = cleaning_window["club_position"].fillna(value="Unknown").astype("str")
cleaning_window["club_team"] = cleaning_window["club_team"].fillna(value="Unknown").astype("str")
# Subset engineered columns
temp = ["club_jersey_number", "club_position", "club_team"]
temp_df = cleaning_window.loc[:, temp]
# Push engineered columns to df
df = pd.concat(objs=[df, temp_df], axis=1)
# Remove engineered columns from cleaning_window
cleaning_window = cleaning_window.drop(labels=temp, axis=1)
# Check what's left
cleaning_window.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 17954 entries, 0 to 17953 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 club_join_date 16018 non-null datetime64[ns] 1 release_clause_euro 16117 non-null float64 2 contract_end_year 17593 non-null object 3 value_euro 17699 non-null float64 4 wage_euro 17708 non-null float64 5 club_rating 17940 non-null float64 dtypes: datetime64[ns](1), float64(4), object(1) memory usage: 841.7+ KB
# Compare to other numeric features to gain insight
temp_df = pd.concat(objs=[df, cleaning_window], axis=1).select_dtypes(exclude=["object", "datetime64[ns]"]).dropna()
temp_corr = temp_df.corr().abs().round(2)
temp_corr["value_euro"].reset_index().sort_values(by="value_euro", ascending=False).head(10)
| index | value_euro | |
|---|---|---|
| 71 | value_euro | 1.00 |
| 70 | release_clause_euro | 0.99 |
| 72 | wage_euro | 0.86 |
| 21 | international_reputation(1-5) | 0.65 |
| 26 | overall_rating | 0.63 |
| 29 | potential | 0.58 |
| 30 | reactions | 0.55 |
| 73 | club_rating | 0.54 |
| 45 | CM | 0.49 |
| 44 | LCM | 0.49 |
Here, we can see that value euro is most closely associated with release_clause_euro and wage_euro. However, international_rating(1-5), overall_rating, and potential also have some mild association with value, as well. Let's try using full columns to predict individual observations in the feature space of cleaning_window. Scikit-learn's experimental IterativeImputer object utilizes Bayesian Ridge Regression to iteratively impute missing numerical values from all others.
# Join predictive and sparse variables in imputable_df
pred_df = df.loc[:, ["international_reputation(1-5)", "overall_rating", "potential"]]
imputable_df = pd.concat(objs=[cleaning_window, pred_df], axis=1).drop(labels=["club_join_date", "contract_end_year"], axis=1)
imputable_df
| release_clause_euro | value_euro | wage_euro | club_rating | international_reputation(1-5) | overall_rating | potential | |
|---|---|---|---|---|---|---|---|
| 0 | 226500000.0 | 110500000.0 | 565000.0 | 86.0 | 5 | 94 | 94 |
| 1 | 133800000.0 | 69500000.0 | 205000.0 | 83.0 | 3 | 88 | 89 |
| 2 | 144200000.0 | 73000000.0 | 255000.0 | 82.0 | 4 | 88 | 91 |
| 3 | 105400000.0 | 62000000.0 | 165000.0 | 82.0 | 3 | 88 | 88 |
| 4 | 106500000.0 | 60000000.0 | 135000.0 | 82.0 | 3 | 88 | 91 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 17949 | 1800000.0 | 975000.0 | 3000.0 | 67.0 | 1 | 67 | 70 |
| 17950 | 295000.0 | 190000.0 | 1000.0 | 65.0 | 1 | 59 | 67 |
| 17951 | 289000.0 | 170000.0 | 1000.0 | 69.0 | 1 | 59 | 67 |
| 17952 | 532000.0 | 280000.0 | 4000.0 | 74.0 | 1 | 59 | 71 |
| 17953 | 626000.0 | 325000.0 | 2000.0 | 71.0 | 1 | 59 | 75 |
17954 rows × 7 columns
# Save pre-image of feature statistics for later validation
pre_impute_stats = imputable_df.iloc[:, :4].describe()
# Impute iteratively
imputer = IterativeImputer(random_state=2025)
imputed_data = imputer.fit_transform(imputable_df)
# Convert back to DataFrame
imputed_df = pd.DataFrame(imputed_data, columns=imputable_df.columns)
imputed_df
| release_clause_euro | value_euro | wage_euro | club_rating | international_reputation(1-5) | overall_rating | potential | |
|---|---|---|---|---|---|---|---|
| 0 | 226500000.0 | 110500000.0 | 565000.0 | 86.0 | 5.0 | 94.0 | 94.0 |
| 1 | 133800000.0 | 69500000.0 | 205000.0 | 83.0 | 3.0 | 88.0 | 89.0 |
| 2 | 144200000.0 | 73000000.0 | 255000.0 | 82.0 | 4.0 | 88.0 | 91.0 |
| 3 | 105400000.0 | 62000000.0 | 165000.0 | 82.0 | 3.0 | 88.0 | 88.0 |
| 4 | 106500000.0 | 60000000.0 | 135000.0 | 82.0 | 3.0 | 88.0 | 91.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 17949 | 1800000.0 | 975000.0 | 3000.0 | 67.0 | 1.0 | 67.0 | 70.0 |
| 17950 | 295000.0 | 190000.0 | 1000.0 | 65.0 | 1.0 | 59.0 | 67.0 |
| 17951 | 289000.0 | 170000.0 | 1000.0 | 69.0 | 1.0 | 59.0 | 67.0 |
| 17952 | 532000.0 | 280000.0 | 4000.0 | 74.0 | 1.0 | 59.0 | 71.0 |
| 17953 | 626000.0 | 325000.0 | 2000.0 | 71.0 | 1.0 | 59.0 | 75.0 |
17954 rows × 7 columns
# Save post-image feature statistics
post_impute_stats = imputed_df.iloc[:, :4].describe()
# Examine the relative change resulting from iterative imputation
rel_change_stats = ((post_impute_stats - pre_impute_stats) / pre_impute_stats).abs().round(2).iloc[1:, :]
rel_change_stats
| release_clause_euro | value_euro | wage_euro | club_rating | |
|---|---|---|---|---|
| mean | 0.01 | 0.00 | 0.00 | 0.0 |
| std | 0.03 | 0.01 | 0.01 | 0.0 |
| min | 3.30 | 0.00 | 0.00 | 0.0 |
| 25% | 0.06 | 0.00 | 0.00 | 0.0 |
| 50% | 0.02 | 0.04 | 0.00 | 0.0 |
| 75% | 0.14 | 0.10 | 0.10 | 0.0 |
| max | 0.00 | 0.00 | 0.00 | 0.0 |
While there is a significant increase in the minimum release clause, relative changes to the remaining variables are somewhat minimal.
# Push engineered columns to df
temp = rel_change_stats.columns.tolist()
temp_df = imputed_df.loc[:, temp]
df = pd.concat(objs=[df, temp_df], axis=1)
# Remove engineered columns from cleaning_window
cleaning_window = cleaning_window.drop(labels=temp, axis=1)
# Check what's left
cleaning_window.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 17954 entries, 0 to 17953 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 club_join_date 16018 non-null datetime64[ns] 1 contract_end_year 17593 non-null object dtypes: datetime64[ns](1), object(1) memory usage: 280.7+ KB
cleaning_window["contract_end_year"].unique()
array(['2021', '2020', '2022', '2023', '2024', 'Jun 30, 2019', '2025',
nan, '2019', 'May 31, 2020', 'Jun 30, 2020', 'Dec 31, 2019',
'May 31, 2019', 'Jan 1, 2020', 'May 21, 2019', 'Jan 31, 2020',
'Feb 27, 2020', 'Dec 1, 2019', 'Oct 14, 2019', 'Dec 31, 2018',
'Mar 1, 2019', 'Jun 1, 2020', 'Jul 2, 2019', 'Dec 1, 2020', '2026'],
dtype=object)
# Force entry conformity
end_year_dict = {
'2021': 2021,
'2020': 2020,
'2022': 2022,
'2023': 2023,
'2024': 2024,
'Jun 30, 2019': 2019,
'2025': 2025,
'nan': np.nan,
'2019': 2019,
'May 31, 2020': 2020,
'Jun 30, 2020': 2020,
'Dec 31, 2019': 2020,
'May 31, 2019': 2019,
'Jan 1, 2020': 2020,
'May 21, 2019': 2019,
'Jan 31, 2020': 2020,
'Feb 27, 2020': 2020,
'Dec 1, 2019': 2019,
'Oct 14, 2019': 2019,
'Dec 31, 2018': 2019,
'Mar 1, 2019': 2019,
'Jun 1, 2020': 2020,
'Jul 2, 2019': 2019,
'Dec 1, 2020': 2020,
'2026': 2026
}
cleaning_window["contract_end_year"] = cleaning_window["contract_end_year"].map(end_year_dict)
# Visualize contracts
sns.histplot(data=cleaning_window, x="contract_end_year", color="green", shrink=4)
plt.xlabel("Year")
plt.ylabel("Number of Contracts Expiring")
plt.title("Figure 06: Contract Expiry Distribution")
plt.tight_layout()
plt.show()
cleaning_window.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 17954 entries, 0 to 17953 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 club_join_date 16018 non-null datetime64[ns] 1 contract_end_year 17593 non-null float64 dtypes: datetime64[ns](1), float64(1) memory usage: 280.7 KB
# Delete temporary variables
#del temp
#del temp_df
#del temp_corr
#del imputed_data
#del imputed_df
# Coerce datatypes
df["release_clause_euro"] = df["release_clause_euro"].astype("int32")
df["value_euro"] = df["value_euro"].astype("int32")
df["wage_euro"] = df["wage_euro"].astype("int32")
# Inspect our dataframe
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 17954 entries, 0 to 17953 Data columns (total 84 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 GK_diving 17954 non-null int32 1 GK_handling 17954 non-null int32 2 GK_kicking 17954 non-null int32 3 GK_positioning 17954 non-null int32 4 GK_reflexes 17954 non-null int32 5 acceleration 17954 non-null int32 6 age 17954 non-null int32 7 aggression 17954 non-null int32 8 agility 17954 non-null int32 9 balance 17954 non-null int32 10 ball_control 17954 non-null int32 11 birth_date 17954 non-null datetime64[ns] 12 body_type 17954 non-null object 13 composure 17954 non-null int32 14 crossing 17954 non-null int32 15 curve 17954 non-null int32 16 dribbling 17954 non-null int32 17 finishing 17954 non-null int32 18 freekick_accuracy 17954 non-null int32 19 full_name 17954 non-null object 20 heading_accuracy 17954 non-null int32 21 height_cm 17954 non-null float64 22 id 17954 non-null int32 23 interceptions 17954 non-null int32 24 international_reputation(1-5) 17954 non-null int32 25 jumping 17954 non-null int32 26 long_passing 17954 non-null int32 27 long_shots 17954 non-null int32 28 marking 17954 non-null int32 29 name 17954 non-null object 30 nationality 17954 non-null object 31 overall_rating 17954 non-null int32 32 penalties 17954 non-null int32 33 positioning 17954 non-null int32 34 positions 17954 non-null object 35 potential 17954 non-null int32 36 preferred_foot 17954 non-null object 37 reactions 17954 non-null int32 38 short_passing 17954 non-null int32 39 shot_power 17954 non-null int32 40 skill_moves(1-5) 17954 non-null int32 41 sliding_tackle 17954 non-null int32 42 sprint_speed 17954 non-null int32 43 stamina 17954 non-null int32 44 standing_tackle 17954 non-null int32 45 strength 17954 non-null int32 46 vision 17954 non-null int32 47 volleys 17954 non-null int32 48 weak_foot(1-5) 17954 non-null int32 49 weight_kgs 17954 non-null float64 50 work_rate 17954 non-null object 51 LM 17954 non-null int32 52 LCM 17954 non-null int32 53 CM 17954 non-null int32 54 RCM 17954 non-null int32 55 RM 17954 non-null int32 56 LWB 17954 non-null int32 57 LDM 17954 non-null int32 58 CF 17954 non-null int32 59 CAM 17954 non-null int32 60 CDM 17954 non-null int32 61 RDM 17954 non-null int32 62 RWB 17954 non-null int32 63 LB 17954 non-null int32 64 LCB 17954 non-null int32 65 CB 17954 non-null int32 66 RCB 17954 non-null int32 67 RAM 17954 non-null int32 68 RB 17954 non-null int32 69 LAM 17954 non-null int32 70 LW 17954 non-null int32 71 RW 17954 non-null int32 72 ST 17954 non-null int32 73 RS 17954 non-null int32 74 LS 17954 non-null int32 75 LF 17954 non-null int32 76 RF 17954 non-null int32 77 club_jersey_number 17954 non-null int32 78 club_position 17954 non-null object 79 club_team 17954 non-null object 80 release_clause_euro 17954 non-null int32 81 value_euro 17954 non-null int32 82 wage_euro 17954 non-null int32 83 club_rating 17954 non-null float64 dtypes: datetime64[ns](1), float64(3), int32(71), object(9) memory usage: 6.6+ MB
# Subset numerically interpretable columns from df
X = df.select_dtypes(exclude=['datetime64[ns]', 'object'])
# print(f"X Shape: {X.shape}\n")
# Calculate Pearson correlation matrix
corr = X.corr()
# Hierarchical Clustering with Farthest Neighbor method
plt.figure(figsize=(12,6))
dissimilarity = 1 - corr.abs()
Z = linkage(y=squareform(dissimilarity), method='complete')
# Visualize feature hierarchies
dendrogram(Z, labels=X.columns, orientation='top', leaf_rotation=90)
plt.title("Figure 07: Hierarchical Clustering Dendrogram using Complete Linkage")
plt.xlabel("Component Features")
plt.tight_layout()
plt.show()
The above dendrogram is representative of complete-linkage hierarchical clustering. Each color corresponds to a specific cluster. For example, agression, marking, interceptions are colored orange and grouped with other offensive features, such as stamina, GK_kicking, etc. Meanwhile, monetary features like wage_euro and rankings like international_ratings(1-5) eventually join in with traits like composure. While the code snippet originally used a threshold of 0.8, we modified this to reflect the psuedo-minimum cluster height. Within our features, the "orange" cluster reaches a height of 0.5, so we choose that as our threshold below.
# Clusterize the data
threshold = 0.5
labels = fcluster(Z, threshold, criterion='distance')
# Keep the indices to sort labels
labels_order = np.argsort(labels)
# Build a new dataframe with the sorted columns; refactored to avoid looping
clustered = X[X.columns[labels_order]]
corr_new = clustered.corr()
# Generate custom text for hover
hover_text = [
[f"x: {corr_new.columns[j]};<br>y: {corr_new.index[i]};<br>r: {corr_new.iloc[i, j]:.2f}"
for j in range(corr_new.shape[1])]
for i in range(corr_new.shape[0])
]
# Create heatmap
fig = go.Figure(data=go.Heatmap(
z=corr_new.values,
x=corr_new.columns,
y=corr_new.index,
colorscale='RdBu',
zmin=-1,
zmax=1,
text=hover_text,
hoverinfo='text'
))
# Layout adjustments
fig.update_layout(
title="Correlation Matrix",
width=1000,
height=1000
)
# Show interactive plot
fig.show()
IV. Insights¶
Utilizing linkage criteria, we reordered our numerical features, and thus our correlation matrix, to position related features closer to one another in terms of their indices. This mimics the functionality of method='hclust within RLang's corrplot package. The use of Plotly over Seaborn in this case allows for interactive objects, which also save space since we don't have to specify annot within sns.heatmap. Instead, we simply hover over the cells to access the details. By examining the above heatmap, we can quickly make a few observations:
- As the "orange" cluster of dendrogram suggested,
aggressionis most closely associated with interceptions, tackles, and marking; since r >= 0.72, we can classify these features as having a strong positive correlation with aggression. - The features
GK_kicking,GK_reflexes, andGK_handlingshare incredibly strong positive correlations (r=> 0.97). This unusually high association suggests these variables may measure the same underlying quantity. - Meanwhile, various attacking positions, such as Centre Attacking Middle, Left-Winger, and Center Forward, show a strong positive correlation. While some positions show r = 0.99, others remain highly correlated but distinct. Indeed, this indicates yet again that we may have redundant variables.
- Interestingly,
staminahas strong negative correlations withGK_handling,GK_reflexes, andGK_kicking, among others. Players with high stamina tend to be great at dribbling, ball control, and short passes, but may struggle more with precise movements. One could investigate the difference between endurance-based support players with high stamina and energetic burst players. - Some features, such as
jumping,weak_foot, andcontract_year_enddo not seem to be meaningfully associated with many other variables, if at all.
V. Suggestions: Cluster Analysis and Hypothesis Testing¶
I would have preferred to progress into unsupervised ML from a more abstract point. In particular, after scaling the data, PCA should allow for reducing features while maintaining variability. This could be used to consider highly correlated features in their more abstract forms. Following PCA, k-means cluster analysis with k = 5 would allow for aggregate comparison and potentially rich storytelling. Additionally, I would be interested in utilizing the national subset aggregates to test hypotheses regarding differences in various performance metrics. Although scipy.stats has many of these capabilities, the pingouin package includes more explicit labeling and a neater output.